PMCSched: Scheduling algorithms made easy
PMCSched was born as a continuation of PMCTrack, an OS-oriented performance monitoring tool for Linux. With the PMCSched framework we take PMCTrack’s potential one step further by enabling rapid development of OS support for scheduling and resource management for Linux within a loadable kernel module. The diagram below illustrates the addition of PMCSched into the PMCTrack design.
PMCSched is built on top of PMCTrack, and so the first step requires installing and building PMCTrack. Instructions for that can be found here. Those instructions are highly encourage to be read, since they include insights on the use of PMCTrack from user space, from the OS scheduler, and PMCTrack monitoring modules.
Once PMCTrack is installed we can go ahead and start creating our custom scheduling algorithm. A new scheduling or resource management algorithm can be implemented by creating a scheduling plugin. We start by defining our new plugin on PMCSched's main header pmcsched.h. For example, with a new example plugin with ID SCHED_EXAMPLE (added into the enum of policies). Our new plugin has to be included into the array of available schedulers too, in that same header.
pmcsched.h
/* Supported Scheduling policies */ typedef enum { (...) SCHED_EXAMPLE, (...) NUM_SCHEDULERS /* Others in the future... */ } sched_policy_mm_t; }
extern struct sched_ops example_plugin;
static __attribute__ ((unused)) struct sched_ops* available_schedulers[NUM_SCHEDULERS]={ (...) &example_plugin, }; }
after that, we can start developing our new plugin right away, in a new file example.c. Creating a plugin works similarly to Linux kernel modules, the plugin follows a "contract" and implements a number of functions that every plugin should have, each with specific function attributes, and intended to handle specific events.
example.c
The various algorithm-specific operations are invoked from the core part of the scheduling framework when a key scheduling-related event occurs, such as when a threads enters the system, terminates, becomes runnable/non-runnable, or when tick processing is due to update statistics. The framework also provides a set of callbacks to carry out periodic scheduling activations from interrupt (timer) and process (kernel thread) context on each core group separately, thus making it possible to invoke a wide range of blocking and non-blocking scheduling-related kernel API calls, such as those to map a thread to a specific CPU or core group. This modular approach to creating scheduling algorithms re- sembles the one used by scheduling classes (algorithms) inside the Linux kernel, but with a striking advantage: PMCSched scheduling plugins can be bundled in a kernel module that can be loaded on unmodified kernels.
Let's go ahead and prepare the basic events that our scheduling plugin should have logic to react to:
There are also some events that our plugin could optionally track. An example of some are:
The minimum required functions along with a policy ID, optional flags and string description. That would go into our new file:
sched_ops_t example_plugin = { .policy = SCHED_EXAMPLE, .description = "Example plugin", .flags = PMCSCHED_CPUGROUP_LOCK, .sched_timer_periodic = sched_timer_periodic_example, .on_active_thread = on_active_thread_example, .on_inactive_thread = on_inactive_thread_example, .on_exit_thread = on_exit_thread_example, .on_migrate_thread = on_migrate_thread_example, };
Some of the fields of our plugin are conceived to ease development from user space. PMCSched exposes an entry of the proc/ filesystem that includes information on the available schedulers, their descriptions, and their ids. These configuration files can also be echoed to enable or disable verbose mode. Some places within PMCSched may print meaningful information when the verbose mode is active. In general, PMCSched uses trace_printk() to output insightful debugging information, but the plugins can check active_scheduler_verbose before printing information with printk. Alternatively, plugins can check if DEBUG was defined (see file pmcsched.c).
trace_printk()
active_scheduler_verbose
Once we have defined and added our new plugin, we can now start developing functions for these cases:
/* ======================================================================= * Example scheduling plugin for PMCSched * Author Carlos Bilbao * ======================================================================= */ #include static void sched_timer_periodic_example(sized_list_t* migration_list){} static void on_active_thread_example(pmcsched_thread_data_t* t){} static void on_inactive_thread_example(pmcsched_thread_data_t* t){} static void on_exit_thread_example(pmcsched_thread_data_t* t){} static void on_migrate_thread_example(pmcsched_thread_data_t* t, int prev_cpu, int new_cpu){}
We can now start implementing the logic for our example plugin. Let us, for example, implement something that triggers random migrations within groups of applications.
When the task becomes active, function on_active_thread_example() will be called. We can insert it into the list of active threads of the app, as well as of the apps within the current group. We can check if it is a newly created application and inform through the kernel buffer if the DEBUG option is enabled.
on_active_thread_example()
static void on_active_thread_example(pmcsched_thread_data_t* t) { sched_thread_group_t* cur_group=get_cur_group_sched(); app_t_pmcsched* app=get_group_app_cpu(t,cur_group->cpu_group->group_id); #ifdef DEBUG trace_printk("ACTIVE t=%p sched_group=%p app=%p\n",t,cur_group,app); #endif t->cur_group=cur_group; /* Point to the right per-group resource monitoring, and allocation data */ t->cmt_data=&app->app_cache.app_cmt_data; /* Insert structure in per-application and global lists */ insert_sized_list_tail(&app->app_active_threads,t); insert_sized_list_tail(&cur_group->active_threads,t); /* Check if it's a new active application */ if (sized_list_length(&app->app_active_threads)==1) { insert_sized_list_tail(&cur_group->active_apps,app); #ifdef DEBUG trace_printk("An application just became active\n"); } else { trace_printk("A thread of a multithreaded program just became active\n"); #endif } }
When the task becomes inactive, function on_inactive_thread_example() will be called. The logic will resemble the activation case, but inverted. Hence, we have to remove structure from per-application and global lists:
on_inactive_thread_example()
static void on_inactive_thread_example(pmcsched_thread_data_t* t) { sched_thread_group_t* cur_group=t->cur_group; app_t_pmcsched* app=get_group_app_cpu(t,cur_group->cpu_group->group_id); #ifdef DEBUG trace_printk("INACTIVE t=%p sched_group=%p app=%p\n",t,cur_group,app); #endif remove_sized_list(&app->app_active_threads,t); remove_sized_list(&cur_group->active_threads,t); if (sized_list_length(&app->app_active_threads)==0) { remove_sized_list(&cur_group->active_apps,app); #ifdef DEBUG trace_printk("An application just became inactive\n"); } else { trace_printk("A thread of a multithreaded program just became inactive\n"); #endif } t->cur_group=NULL; }
When the periodic kthread is triggered, function sched_timer_periodic_example() will be called. This is the function that actually implements the scheduling algorithm, using the per-group linked lists and making changes on which threads are allowed to run at a given time. First, we check if there is any stopped threads (since otherwise there is no point in stopping running threads), and then we traverse the list of active applications within the group.
sched_timer_periodic_example()
static void sched_timer_periodic_example (void) { char buf[150]=""; char* dest=buf; static int migration_counter=0; sched_thread_group_t* cur_group=get_cur_group_sched(); app_t_pmcsched* cur; app_t* app; pmcsched_thread_data_t *first, t; uint64_t cache_usage; uint_t llc_id=cur_group->cpu_group->group_id; if (sized_list_length(&cur_group->active_apps)==0) return; /* Traverse list */ for (cur=head_sized_list(&cur_group->active_apps); cur!=NULL; cur=next_sized_list(&cur_group->active_apps,cur)) { first=head_sized_list(&cur->app_active_threads); app=&cur->app_cache; intel_cmt_update_supported_events(&pmcs_cmt_support,&app->app_cmt_data,llc_id); cache_usage=app->app_cmt_data.last_llc_utilization[llc_id][0]; dest+=sprintf(dest,"%d(%d - %llu - %zuT) ",app->process->tgid,app->app_cmt_data.rmid,cache_usage,sized_list_length(&cur->app_active_threads)); } trace_printk("[Group %i]. Active applications (#threads): %s\n",cur_group->cpu_group->group_id,buf);
migration_counter++; if (migration_counter==3) { migration_counter=0; cur=head_sized_list(&cur_group->active_apps); t=head_sized_list(&cur->app_active_threads); if (!t->migration_data.state) { int cpu_group_count; int target_group; get_platform_cpu_groups(&cpu_group_count); do { target_group=get_random_long()%cpu_group_count; } while (target_group==llc_id); t->migration_data.state=MIGRATION_REQUESTED; t->migration_data.dst_group=target_group; t->migration_data.src_group=llc_id; insert_sized_list_tail(&cur_group->migration_list,t); cur_group->activate_kthread=1; } else { trace_printk("Attempted remigration for list with %zd items\n",sized_list_length(&cur_group->migration_list)); cur_group->activate_kthread=1; }}
static void on_migrate_thread_group(pmcsched_thread_data_t* t, int prev_cpu, int new_cpu) { sched_thread_group_t* cur_group=get_cur_group_sched(); sched_thread_group_t* old_group=t->cur_group; unsigned long flags; if (prev_cpu==-1 || (cur_group==old_group)) return; if (old_group) { spin_lock_irqsave(&old_group->lock,flags); if (t->migration_data.state) { if (t->migration_data.dst_group!=cur_group->cpu_group->group_id) trace_printk("WARNING: Migration to wrong group\n"); remove_sized_list(&old_group->migration_list,t); t->migration_data.state=MIGRATION_COMPLETED; } on_inactive_thread_group(t); spin_unlock_irqrestore(&old_group->lock,flags); } spin_lock_irqsave(&cur_group->lock,flags); on_active_thread_group(t); spin_unlock_irqrestore(&cur_group->lock,flags); }
src/modules/pmcs/intel-core/Makefile
MODULE_NAME=mchw_intel_core obj-m += $(MODULE_NAME).o PMCSCHED-objs= example_plugin.o \ (...)
Arguably, one of PMCSched's coolest features is its ability to collect information regarding Performance Monitoring Counters (PMCs), using the APIs provided by PMCTrack.
You can configure your plugin to collect certain metrics and events, such as instruction count, cycles, LLC misses and LLC references. This is particularly interesting to profile entering applications.
static metric_experiment_set_t metric_description= { .nr_exps=1, /* 1 set of metrics */ .exps={ /* Metric set 0 */ { .metrics= { PMC_METRIC("IPC",op_rate,INSTR_EVT,CYCLES_EVT,1000), PMC_METRIC("RPKI",op_rate,LLC_ACCESSES,INSTR_EVT,1000000), PMC_METRIC("MPKI",op_rate,LLC_MISSES,INSTR_EVT,1000000), PMC_METRIC("MPKC",op_rate,LLC_MISSES,CYCLES_EVT,1000000), PMC_METRIC("STALLS_L2",op_rate,STALLS_L2_MISS,CYCLES_EVT,1000), }, .size=NR_METRICS, .exp_idx=0, }, } };
enum event_indexes { INSTR_EVT=0, CYCLES_EVT, LLC_ACCESSES, LLC_MISSES, STALLS_L2_MISS, L2_LINES, PMC_EVENT_COUNT }; enum metric_indices { IPC_MET=0, RPKI_MET, MPKI_MET, MPKC_MET, STALLS_L2_MET, NR_METRICS, };
pmcsched_counter_config_t
TBS_SCHED_MODE
EBS_SCHED_MODE
static pmcsched_counter_config_t cconfig={ .counter_usage={ .hwpmc_mask=0x3b, /* bitops -h 0,1,3,4,5 */ .nr_virtual_counters=CMT_MAX_EVENTS, .nr_experiments=1, .vcounter_desc={"llc_usage","total_llc_bw","local_llc_bw"}, }, .pmcs_descr=&pmc_configuration, .metric_descr={&metric_description,NULL}, .profiling_mode=TBS_SCHED_MODE, };
counter_config
profile_thread_example()
sched_ops_t lfoc_plugin = { .policy = SCHED_EXAMPLE_PMCs, .description = "Plugin that uses PMCs", .counter_config=&cconfig, (...) .on_new_sample = profile_thread_example, };
static int profile_thread_example(pmon_prof_t* prof, int cpu,pmc_sample_t* sample,int flags,void* data) { pmcsched_thread_data_t* t = prof->monitoring_mod_priv_data; (...) lt->instr_counter += sample->pmc_counts[0]; }